16 research outputs found
Temporal Sentence Grounding in Streaming Videos
This paper aims to tackle a novel task - Temporal Sentence Grounding in
Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance
between a video stream and a given sentence query. Unlike regular videos,
streaming videos are acquired continuously from a particular source, and are
always desired to be processed on-the-fly in many applications such as
surveillance and live-stream analysis. Thus, TSGSV is challenging since it
requires the model to infer without future frames and process long historical
frames effectively, which is untouched in the early methods. To specifically
address the above challenges, we propose two novel methods: (1) a TwinNet
structure that enables the model to learn about upcoming events; and (2) a
language-guided feature compressor that eliminates redundant visual frames and
reinforces the frames that are relevant to the query. We conduct extensive
experiments using ActivityNet Captions, TACoS, and MAD datasets. The results
demonstrate the superiority of our proposed methods. A systematic ablation
study also confirms their effectiveness.Comment: Accepted by ACM MM 202
EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints
Motivated by the superior performance of image diffusion models, more and
more researchers strive to extend these models to the text-based video editing
task. Nevertheless, current video editing tasks mainly suffer from the dilemma
between the high fine-tuning cost and the limited generation capacity. Compared
with images, we conjecture that videos necessitate more constraints to preserve
the temporal consistency during editing. Towards this end, we propose EVE, a
robust and efficient zero-shot video editing method. Under the guidance of
depth maps and temporal consistency constraints, EVE derives satisfactory video
editing results with an affordable computational and time cost. Moreover,
recognizing the absence of a publicly available video editing dataset for fair
comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive
experimentation, we validate that EVE could achieve a satisfactory trade-off
between performance and efficiency. We will release our dataset and codebase to
facilitate future researchers
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
In recent years, the explosion of web videos makes text-video retrieval
increasingly essential and popular for video filtering, recommendation, and
search. Text-video retrieval aims to rank relevant text/video higher than
irrelevant ones. The core of this task is to precisely measure the cross-modal
similarity between texts and videos. Recently, contrastive learning methods
have shown promising results for text-video retrieval, most of which focus on
the construction of positive and negative pairs to learn text and video
representations. Nevertheless, they do not pay enough attention to hard
negative pairs and lack the ability to model different levels of semantic
similarity. To address these two issues, this paper improves contrastive
learning using two novel techniques. First, to exploit hard examples for robust
discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module
(DMAE) to mine hard negative pairs from textual and visual clues. By further
introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively
identify all these hard negatives and explicitly highlight their impacts in the
training loss. Second, our work argues that triplet samples can better model
fine-grained semantic similarity compared to pairwise samples. We thereby
present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to
construct partial order triplet samples by automatically generating
fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL
designs an adaptive token masking strategy with cross-modal interaction to
model subtle semantic differences. Extensive experiments demonstrate that the
proposed approach outperforms existing methods on four widely-used text-video
retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.Comment: Accepted by ACM MM 202
Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors
We demonstrate a car damage assessment system in car insurance field based on artificial intelligence techniques, which can exempt insurance inspectors from checking cars on site and help people without professional knowledge to evaluate car damages when accidents happen. Unlike existing approaches, we utilize videos instead of photos to interact with users to make the whole procedure as simple as possible. We adopt object and video detection and segmentation techniques in computer vision, and take advantage of multiple frames extracted from videos to achieve high damage recognition accuracy. The system uploads video streams captured by mobile devices, recognizes car damage on the cloud asynchronously and then returns damaged components and repair costs to users. The system evaluates car damages and returns results automatically and effectively in seconds, which reduces laboratory costs and decreases insurance claim time significantly
Sperm cells are passive cargo of the pollen tube in plant fertilization
Sperm cells of seed plants have lost their motility and are transported by the vegetative pollen tube cell for fertilization, but the extent to which they regulate their own transportation is a long-standing debate. Here we show that Arabidopsis lacking two bHLH transcription factors produces pollen without sperm cells. This abnormal pollen mostly behaves like the wild type and demonstrates that sperm cells are dispensable for normal pollen tube development
Maternal ENODLs Are Required for Pollen Tube Reception in Arabidopsis
During the angiosperm (flowering-plant) life cycle, double fertilization represents the hallmark between diploid and haploid generations [1]. The success of double fertilization largely depends on compatible communication between the male gametophyte (pollen tube) and the maternal tissues of the flower, culminating in precise pollen tube guidance to the female gametophyte (embryo sac) and its rupture to release sperm cells. Several important factors involved in the pollen tube reception have been identified recently [2-6], but the underlying signaling pathways are far from being understood. Here, we report that a group of female-specific small proteins, early nodulin-like proteins (ENODLs, or ENs), are required for pollen tube reception. ENs are featured with a plastocyanin-like (PCNL) domain, an arabinogalactan (AG) glycomodule, and a predicted glycosylphosphatidylinositol (GPI) anchor motif. We show that ENs are asymmetrically distributed at the plasma membrane of the synergid cells and accumulate at the filiform apparatus, where arriving pollen tubes communicate with the embryo sac. EN14 strongly and specifically interacts with the extracellular domain of the receptor-like kinase FERONIA, localized at the synergid cell surface and known to critically control pollen tube reception [6]. Wild-type pollen tubes failed to arrest growth and to rupture after entering the ovules of quintuple loss-of-function EN mutants, indicating a central role of ENs in male-female communication and pollen tube reception. Moreover, overexpression of EN15 by the endogenous promoter caused disturbed pollen tube guidance and reduced fertility. These data suggest that female-derived GPI-anchored ENODLs play an essential role in male-female communication and fertilization.Natural Science Foundation of China [31230006, 31370344]; National Basic Research Program of China [2012CB944801]; German Research Council (DFG Collaborative Research Center) [SFB924]SCI(E)[email protected]